q

Modeling Refugee Arrivals in Europe

1. Introduction

In September 2015, the British Red Cross (BRC) developed a model to anticipate refugee flows in Europe.1 Such models are useful to aid organisations due to the insight they offer into the volume of refugees and the routes they travel. In turn, this allows for strategic planning of goods (e.g. foodstuffs, blankets and medicines) and volunteers to areas that are most likely to need them.

The model created by the BRC is based on count data2 published by the United Nations High Commisioner for Refugees (UNHCR).3 The results were generally regarded as positive but found to be more stable for countries that are situated closer to the final destination of refugees.4 In this document, we perform several exploratory analyses aimed improving our understanding of the UNHCR data and improving the initial model created by the BRC. To this end, we first outline the problem statement and goals. Then we perform an exploratory data analysis of the UNHRC data. In turn, we construct several models aimed at outperforming the BRC model. We conclude with limitations and suggestions for follow-up projects.

2. Problem statement and project goals

The Leiden University Centre for Innovation (CFI) was asked to explore the UNHCR data and, if possible, improve on the original BRC model. Concretely, our goals were as follows:

  1. Gain a better understanding of the count data. That is, try to better understand its limitations.
  2. Outperform the BRC model in prediction accuracy and robustness.
  3. Find a way to predict refugee arrivals at least two days ahead.

This exploratory analysis is limited to the period between the September 1, 2015 and March 14, 2016. This means that recent events (e.g. EU-Turkey deal) are not reflected in our dataset.

The exploratory data analysis for the UNHRC data is conducted on the entire dataset. For points 2 and 3 listed above, we decided to take one country (Austria) as a test case.

3. Exploratory Data Analysis

The data we use for our analysis comes straight from the UNHCR google spreadsheet (see footnote 2 below). Figure 1 below gives an overview of the countries represented in the dataset.

Figure 1: Area of interest

Figure 1: Area of interest

3.1 Data quality

Looking at figure 2 below, we observe several discrepancies:

  1. The data is of varying quality between rows 0 and 50.
  2. The first four columns have almost no missing values, whereas the columns thereafter suggest varying data quality.
  3. Hungary is an outlier in that almost no refugees pass through the country after row 50 or so.
Figure 2: Data quality

Figure 2: Data quality

In general, these dynamics are not out of the ordinary. Hungary closed its borders around October (see below) and the NA values for Slovenia, Croatia and Austria are likely due to the fact that the UNHCR started counting refugees at a later point in time in these areas.

3.3 Correlating arrivals between countries

The plots below are obtained as follows:

  1. Take the refugee arrivals for two countries (say, Greece and Macedonia)
  2. Define a vector of parameters by which to lag the refugee arrivals in Greece. The idea being that if, say, traveling to Macedonia takes 2 days, Greek refugee arrivals on e.g. Monday should highly correlate with refugee arrivals in Macedonia on Wednesday. Here, we lag from 0 days to 20 days.
  3. Repeat point 2 for every row (that is, for every day between September 1st 2015 and February 16th 2016).
  4. For each of these days, take the time lag for which the correlation is highest between the refugee arrivals in both countries.
  5. Plot this for each row.

KNN imputation is used to correct for NA values (with k == 4).

These plots show several patterns:

  1. For almost each country, the correlation becomes less strong after day 50. This indicates the occurence of one or multiple exogenous shocks (see points above).
  2. The plot for greece - croatia is most volatile. This is not unexpected (look at the map above), because we don’t have data between these countries.
  3. Macedonia - serbia may be stable due to the transportation the Macedonian government made available to refugees passing through its territory.

3.4 Regressing lagged arrivals in greece on arrivals in macedonia

To further check the robustness of the correlations above, we can define several linear models with different lags to see where/how they break down. This is done below for the relationship of refugee arrivals in Macedonia (dependent) and Greece (independent).

Eye-balling the plots below, it appears that the most stable relationship exists between a time lag of 4 days for Greek refugee arrivals and those of Macedonian refugee arrivals. This not entirely in line with the correlational plot above, where a lag of 3 days was found to be optimal for roughly 120 days.

We can plot simple linear models for each of the relationships above and examine the residuals to gain a better understanding of the relationship. The model with a time lag of t==4 indeed scores highest in terms of \(R^2\) (roughly 0.54). The beta coefficients should be intepreted such that each additional refugee in greece (with appropriate time lag) leads to one additional refugee in Macedonia; e.g. for a time lag of 4 days, this means that for each additional refugee in Greece with lag 4, we expect .87 refugee to arrive in Macedonia.

The residuals show that the RSS is minimized for a time lag of 4 days. However, all models show strong signs of autocorrelation and heteroskedasticity. Given the nature of the data (i.e. time-series) and the lack of predictors, this is not unexpected.

3.5 Conclusion

The UNHCR data shows interesting patterns with respect to refugee flows between countries. One obvious drawback of this data is that it pertains to counts of refugees only, and that it is therefore quite limiting in terms of its ability to take into account large exogenous shocks (e.g. closing of borders). Nonetheless, the exploratory analyses show that there is some merit to the approach of lagging refugee arrivals such that we can follow their movements across countries. In the next section, we will follow this approach to predict refugee arrivals in Austria.

4. Predicting refugee arrivals in Austria

In this section, we will evaluate the performance of several models in predicting refugees in Austria. The first paragraph outlines the data used for this exercise and describes the pre-processing steps we take to prepare the data for modeling. Then, we will look at the model created by Simon Johnson for the BRC. We then evaluate our own model.

4.1 Data, pre-processing and evaluation metrics

We use a specific subset of the UNHRC data for our model which is engineered to follow Simon Johnson’s pre-processing steps as accurately as possible.5 The time period of this data runs from 2015-10-20 to 2016-02-18. The period until 2015-01-01 is used to train our models. The remaining data is used for the purpose of evaluation.

To pre-process the data, we take refugee arrivals in Austria as the dependent (i.e. outcome) variable. The independent variable (i.e. features) consist of refugee arrivals in all other countries, lagged by one to seven days. As such, each country is represented seven times in our dataset. Table 1 below shows the schematic outline of this data.6

Arrivals_Austria_today Arrivals_greece_lag1 Arrivals_greece_lag2 Arrivals_greece_lag7 Arrivals_Macedonia_lag1 etc.
500 250 150 700 425
700 550 250 890 300

All analyses are performed using the R statistical language. We use the caret package in R to train our models.7 We use Leave-One-Out Cross-Validation (LOOCV) as our cross-validation technique. The results for all models are presented in a Shiny application which is available here. In this document, we only report on the model that we regard as being the best fit.

We use both the absolute error and the Root-Mean Square Error (RMSE) as evaluation metrics for our models.

The BRC model

A lasso model is essentially a OLS linear regression model where the RSS is constrained according to some budget determined by a tuning parameter \(\lambda\), which is estimated using cross-validation.8 This causes some features to shrink almost or completely to 0. As such, the Lasso performs feature selection.

The figure below shows the actual refugee count in Austria versus the predicted refugee count when predicting one day ahead. The absolute error for this model is 14865 and the RMSE is 536.

Despite being a linear model, the Lasso performs relatively well on our test data. There are several days on which the model strongly over - or under - predicts the number of refugees. Moreover, it mostly is one day off in picking up on sharp inclines and declines.

While the model does not perform significantly worse in terms of absolute error and RMSE for a two-day prediction (absolute error 14197 and RMSE 522), its predictions are more erratic.

Our model

Our final model is a Boosted General Additive Model (GAM).9 Like the Lasso, it performs variable selection. However, it is non-linear and thus better able to follow the uncertainty related to refugee inflows (although this reduction in bias may also lead to an increase in variance).

The figure below shows the results for a one-day prediction. The absolute error of this model is 11537 and the RMSE is 417. As we can observe, the model quite accurately follows sharp increases and decreases of the data.

Like the original Lasso Model, the boosted GAM model becomes more erratic when trying to predict more than one day ahead. Nonetheless, the model performs quite well (absolute error 11939 and RMSE 428), and is relatively good at following rapid inclines and declines in refugee flows.

5. Limitations, recommendations and conclusions.

The preceding paragraphs show that the BRC model can be improved by using a boosted non-linear model. However, the model has only been evaluated over one test period for one country. Additional testing should focus on testing the model on different time periods for different countries. Moreover, additional testing is needed to determine the appropriate ‘time window’ for the model (that is, the number of days used for prediction), and to determine how often the model should be retrained in order to incorporate new events.

In its current form, this model is not robust enough to base policy on. From a substantive point of view, we need more information about the direct needs of aid organisations and some measure of an acceptable error. This will mitigate situations in which there is an overabundance of supplies (bad, but not devastating) and an underabundance of supplies (potentially disastrous).

Additionally, we believe that dashboards that incorporate such models would benefit greatly by incorporating manual triggers to simulate e.g. closed border crossings. This allows for scenario analysis on the part of aid organisations. Combined with ancillary data such as weather forecasts, geographical information, natural landscape relief, political sentiment, and news, such dashboards can provide aid organisations with a powerful tool.

References


  1. For more information, see the GitHub repository and the corresponding dashboard.

  2. A copy of the data can be found here

  3. See the UNHCR website for more information

  4. For more information, see this article written by Simon Johnson of the BRC.

  5. See Simon Johnson’s GitHub repository for the original pre-processing steps.

  6. A copy of our dataset can be downloaded here

  7. A copy of the R script used to train our models can be downloaded here

  8. See ‘introduction to statistical learning’, esp. pp 219-228

  9. See ‘elements of statistical learning’, esp. pp 295-304

Jasper Ginn, Arvid Halma, Wouter Eekhout and Parisa Zahedi

GitHub repository

Shiny dashboard

Contact us

Leiden University Centre for Innovation

2016-06-17